Nomilo Fishpond Biogeochemical Analysis
  • Correlational Analysis
  • Fieldwork Templates

Data Analysis Workflow:

  • Install Packages
  • Load Libraries
  • Import Raw Data
    • Procedure
    • View Raw Data
  • Tidy Raw Data
    • Tidying Processes
    • Merge Tidied Datasets
    • Export Tidied Datasets
  • Data Dictionary
  • Exploratory Data Analysis
  • Correlational Analysis

Nomilo Fishpond Biogoechemical Analysis

Interactive Code

Throughout this document, hover over the numbered annotations to the right of code chunks to reveal detailed explanations and comments about the code. Where drop-down italicized text is present, expand by pressing on arrow to see code.

Install Packages

install.packages(c("rio", "tidyverse", "janitor", "lubridate", "rmarkdown", "fs"))

Load Libraries

library(rio)
library(tidyverse)
library(janitor)
library(lubridate)
library(rmarkdown)
library(fs)
1
For importing excel data
2
For cleaning of data
3
For cleaning variable names
4
For cleaning dates
5
For displaying tables
6
For file path usage
<<<<<<< HEAD =======

Get Raw Data File Names

```{bash}
#| label: get-raw-data-file-names
cd "data/raw"
ls
```
1
Change directory to “data/raw”
2
List files within directory
2024-02-28_dfs.RData
2024-02-28_ksf-clam-growth.xlsx
2024-02-28_ksf-compiled-data.xlsx
2024-02-28_ksf-oyster-cylinder-growth.xlsx
2024-02-28_profile-data.xlsx
2024-02-28_water-samples.xlsx
2024-02-28_weather-data.xlsx
2024-03-01_dfs_no_profiles.RData
~$2024-02-28_profile-data.xlsx
~$2024-02-28_water-samples.xlsx
>>>>>>> f79388ba582c3a45484705a0a0dd8a6de8037a5c

Import Raw Data

Procedure

Define vector of files to import:

<<<<<<< HEAD
files_to_import <- dir_ls(path = "data/raw")

for (i in seq_along(files_to_import)) {
  cat(i, "= ", files_to_import[i], "\n")
}
=======
```{r}
#| label: create-vector-file-paths
files_to_import <- c(
  "data/raw/2024-02-28_ksf-clam-growth.xlsx",
  "data/raw/2024-02-28_ksf-compiled-data.xlsx",
  "data/raw/2024-02-28_ksf-oyster-cylinder-growth.xlsx",
  "data/raw/2024-02-28_profile-data.xlsx",
  "data/raw/2024-02-28_water-samples.xlsx",
  "data/raw/2024-02-28_weather-data.xlsx"
)
```

Use the purrr::map() function to iteratively import all files in the files_to_import vector:

```{r}
sheets_to_import <- c("L1", "L2", "L3", "L4")
profiles_data <- profiles_data <- map_dfr(sheets_to_import, function(sheet_name) {
  import(files_to_import[4], which = sheet_name)
}) %>% 
  bind_rows()
```
```{r}
#| label: iteratively-import-raw-data
#| messages: false
#| warnings: false

dfs_no_profiles <- map(files_to_import[c(1:3, 5, 6)], import_list)

save(dfs_no_profiles, file = "data/raw/2024-03-01_dfs_no_profiles.RData")
```
>>>>>>> f79388ba582c3a45484705a0a0dd8a6de8037a5c
1
<<<<<<< HEAD Store the file paths of our raw data within the data/raw directory in files_to_import
2
Print each file path with its index ======= It takes a while to import these dataframes, so we’re going to only import. it once and then export it as a .RData file to load it back in more efficiently >>>>>>> f79388ba582c3a45484705a0a0dd8a6de8037a5c
1 =  data/raw/2024-02-28_dfs.RData 
2 =  data/raw/2024-02-28_ksf-clam-growth.xlsx 
3 =  data/raw/2024-02-28_ksf-compiled-data.xlsx 
4 =  data/raw/2024-02-28_ksf-oyster-cylinder-growth.xlsx 
5 =  data/raw/2024-02-28_profile-data.xlsx 
6 =  data/raw/2024-02-28_water-samples.xlsx 
7 =  data/raw/2024-02-28_weather-data.xlsx 
8 =  data/raw/2024-03-01_dfs-no-profiles.RData 
9 =  data/raw/~$2024-02-28_profile-data.xlsx 
10 =  data/raw/~$2024-02-28_water-samples.xlsx 

Use the purrr::map() function to iteratively import files in the files_to_import vector except for the profiles data and .RData files:

@iteratively-import-raw-data Code Chunk Execution Warning

The @iteratively-import-raw-data code chunk should only be ran once when raw data is updated because it takes long to execute. Therefore, run the @efficiently-load-raw-data code chunk instead to easily import up-to-date raw data.

dfs_no_profiles <- map(files_to_import[c(2:4, 6, 7)], import_list)
current_date <- format(Sys.Date(), "%Y-%m-%d")
save(dfs_no_profiles, file = paste0("data/raw/", current_date, "_dfs-no-profiles.RData"))

Efficiently import up-to-date raw data:

<<<<<<< HEAD
load(files_to_import[8])
=======
```{r}
#| label: efficiently-load-raw-data
# load("data/raw/2024-03-dfs_no_profiles.RData") 
```
>>>>>>> f79388ba582c3a45484705a0a0dd8a6de8037a5c

Rename datasets:

We will always use snakecase when naming our data objects and functions (e.g., data_object_name or function_name()).

<<<<<<< HEAD
names(dfs_no_profiles) <- gsub("data/raw/2024-02-28_|\\.xlsx$|\\.xls$", "", 
                               files_to_import[c(2:4, 6, 7)])
names(dfs_no_profiles) <- gsub("-", "_", names(dfs_no_profiles))
names(dfs_no_profiles)
1
Remove prefixes and file extensions
2
Replace hyphens with underscores
3
Check if names were outputted correctly =======
```{r}
#| label: rename-raw-datasets
names(dfs_no_profiles) <- gsub("data/raw/2024-02-28_|\\.xlsx$|\\.xls$", "", files_to_import[c(1:3, 5, 6)])
names(dfs_no_profiles) <- gsub("-", "_", names(dfs_no_profiles))
names(dfs_no_profiles)
```
1
Remove prefixes and file extensions
2
Replace hyphens with underscores
3
Check if names were outputted correctly >>>>>>> f79388ba582c3a45484705a0a0dd8a6de8037a5c
[1] "ksf_clam_growth"            "ksf_compiled_data"         
[3] "ksf_oyster_cylinder_growth" "water_samples"             
[5] "weather_data"              

Rename each sheet within each raw dataset to be lowercased and replace spaces with underscores:

<<<<<<< HEAD
dfs_no_profiles <- map(dfs_no_profiles, ~ set_names(.x, gsub(" ", "_", tolower(names(.x)))))
=======
```{r}
#| label: rename-sheets-within-each-raw-dataset
dfs_no_profiles <- map(dfs_no_profiles, ~ set_names(.x, gsub(" ", "_", tolower(names(.x)))))
```
>>>>>>> f79388ba582c3a45484705a0a0dd8a6de8037a5c

Create separate datasets by specifying the Excel sheet from each spreadsheet we want to tidy:

<<<<<<< HEAD
ksf_clam_growth_data <- dfs_no_profiles$ksf_clam_growth$sheet1
ksf_compiled_data <- dfs_no_profiles$ksf_compiled_data$full_data
ksf_oyster_cylinder_growth_data <- dfs_no_profiles$ksf_oyster_cylinder_growth$sheet1
water_samples_data <- dfs_no_profiles$water_samples$data_overview
weather_data <- dfs_no_profiles$weather_data$weather_ksf

We want to combine multiple sheets within the profiles Excel spreadsheet into one, therefore, we will import it separately:

sheets_to_import <- c("L1", "L2", "L3", "L4")

profiles_data <- profiles_data <- map_dfr(sheets_to_import, function(sheet_name) {
  import(files_to_import[5], which = sheet_name)
}) %>%
  bind_rows()
1
[code annotation]
2
[code annotation]
3
[code annotation]
=======
```{r}
#| label: create-separate-datasets
ksf_clam_growth_data <- dfs_no_profiles$ksf_clam_growth$sheet1
ksf_compiled_data <- dfs_no_profiles$ksf_compiled_data$full_data
ksf_oyster_cylinder_growth_data <- dfs_no_profiles$ksf_oyster_cylinder_growth$sheet1
water_samples_data <- dfs_no_profiles$water_samples$data_overview
weather_data <- dfs_no_profiles$weather_data$weather_ksf
```
>>>>>>> f79388ba582c3a45484705a0a0dd8a6de8037a5c

View Raw Data

  • ksf_clam_growth_data
  • ksf_compiled_data
  • ksf_oyster_cylinder_growth_data
  • water_samples_data
  • weather_data
  • profiles_data
<<<<<<< HEAD

Tidy Raw Data

Tidying Processes

  • ksf_clam_growth_data_tidied
  • ksf_compiled_data_tidied
  • ksf_oyster_cylinder_growth_data_tidied
  • water_samples_data_tidied
  • weather_data_tidied
  • profile_data_tidied
Steps to clean data <<<<<<< HEAD
new_var_names <- c(
  "sort_date", "color", "clams_in_count", "clams_in_lbs", "clams_in_avg_per_lb", 
  "clams_out_count", "clams_out_lbs", "clams_out_avg_per_lb", "growth_in_lbs",
  "growth_pct", "sr", "days_btwn_sort"
  )

new_date_col <- c(
  "2023-10-17", "2023-12-06", "2023-12-12", "2024-01-02", "2024-01-10", "2024-01-24", 
  "2024-01-31", "2024-02-08", "2024-02-13"
  ) 

ksf_clam_growth_data_tidied <- ksf_clam_growth_data %>%
  slice(-1) %>%
  setNames(new_var_names) %>%
  mutate(date = as.Date(new_date_col)) %>% 
  dplyr::select(-sort_date) %>% 
   pivot_longer(
    cols = c(
      clams_in_count, clams_in_lbs, clams_in_avg_per_lb, clams_out_count, 
      clams_out_lbs, clams_out_avg_per_lb
      ),
    names_to = c("stage", ".value"),
    names_prefix = "clams_", 
    names_sep = "_", 
    values_to = "value"
  ) %>%
  mutate(stage = if_else(str_detect(stage, "in"), "In", "Out")) %>%
  rename(avg_per_lbs = avg) %>% 
  mutate(across(c(color, stage), as.factor)) %>%
  mutate(across(c(count, lbs, avg_per_lbs, growth_in_lbs, growth_pct, sr), 
                ~as.numeric(gsub("%", "", .)))) %>% 
  arrange(date, color, stage) %>% 
  dplyr::select(date, days_btwn_sort, color, stage, count, lbs, avg_per_lbs, 
                growth_in_lbs, growth_pct, sr)

paged_table(ksf_clam_growth_data_tidied)
Steps to clean data
ksf_compiled_data_tidied <- ksf_compiled_data %>% 
  rename_with(~gsub("\\s*\\([^\\)]+\\)", "", .x)) %>%
  janitor::clean_names() %>%
  rename(date = date_time) %>%
  mutate(date = as.Date(date)) %>%
  filter(date >= as.Date("2023-11-20") & date <= as.Date("2024-02-20")) %>%
  arrange(date) %>%
  dplyr::select(-c(external_voltage, wk_num, wind_dir,
                   spadd, outdoor_temperature, hourly_rain,
                   solar_radiation, resistivity, battery_capacity,
                   hour, daynum, data_pt, wind_sp, diradd,
                   wind_speed, wind_direction, tide, day, month, year)
                ) %>%
  dplyr::select(where(~ !anyNA(.))) %>%
  group_by(date) %>%
  summarise(across(where(is.numeric), \(x) mean(x, na.rm = TRUE)))

paged_table(ksf_compiled_data_tidied)
=======
```{r}
#| label: tidy-ksf-compiled-data-dataset
#| code-fold: true
#| code-summary: <i>Steps to clean data</i>
ksf_compiled_data_tidied <- ksf_compiled_data %>% 
  rename_with(~gsub("\\s*\\([^\\)]+\\)", "", .x)) %>%
  janitor::clean_names() %>%
  rename(date = date_time) %>%
  mutate(date = as.Date(date)) %>%
  filter(date >= as.Date("2023-11-20") & date <= as.Date("2024-02-20")) %>%
  arrange(date) %>%
  dplyr::select(-c(external_voltage, wk_num, wind_dir,
                   spadd, outdoor_temperature, hourly_rain,
                   solar_radiation, resistivity, battery_capacity,
                   hour, daynum, data_pt, wind_sp, diradd,
                   wind_speed, wind_direction, tide, day, month, year)
                ) %>%
  dplyr::select(where(~ !anyNA(.))) %>%
  group_by(date) %>%
  summarise(across(where(is.numeric), \(x) mean(x, na.rm = TRUE)))

paged_table(ksf_compiled_data_tidied)
```
>>>>>>> f79388ba582c3a45484705a0a0dd8a6de8037a5c
1
<<<<<<< HEAD Clean variable names by removing everything in parentheses, using lowercase and underscores in place of spaces ======= Clean variable names by removing everything in parentheses, using lowercase and underscores in place of spaces >>>>>>> f79388ba582c3a45484705a0a0dd8a6de8037a5c
2
<<<<<<< HEAD Rename the date_time variable to date, filter to desired date range and sort by date ======= Rename the date_time variable to date, filter to desired date range and sort by date >>>>>>> f79388ba582c3a45484705a0a0dd8a6de8037a5c
3
<<<<<<< HEAD Remove unnecessary variables ======= Remove unnecessary variables >>>>>>> f79388ba582c3a45484705a0a0dd8a6de8037a5c
4
<<<<<<< HEAD Remove columns with containing all NA values ======= Remove columns with containing all NA values >>>>>>> f79388ba582c3a45484705a0a0dd8a6de8037a5c
5
<<<<<<< HEAD Group by date and calculate the average of every variable for each day ======= Group by date and calculate the average of every variable for each day >>>>>>> f79388ba582c3a45484705a0a0dd8a6de8037a5c
Steps to clean data <<<<<<< HEAD
oyster_var_names <- c(
  "date", "oyster_large_weight", "oyster_large_gain", "oyster_small_weight",
  "oyster_small_gain", "oyster_chlorophyll"
  )

ksf_oyster_cylinder_growth_data_tidied <- ksf_oyster_cylinder_growth_data %>% 
  dplyr::select(c(1, 4, 5, 8, 9, 12)) %>%
  slice(-1) %>%
  setNames(oyster_var_names) %>%
  pivot_longer(
    cols = c(oyster_large_weight, oyster_large_gain,
             oyster_small_gain,
             oyster_small_weight),
    names_to = c("oyster_size", ".value"),
    names_prefix = "oyster_",
    names_sep = "_",
    values_to = "value"
  ) %>%
  mutate(oyster_size = if_else(str_detect(oyster_size, "small"), "Small", "Large")) %>%
  mutate(oyster_size = as.factor(oyster_size),
         across(c(weight, gain), as.numeric)
        ) %>%
  filter(date >= as.Date("2023-11-20") & date <=
           as.Date("2024-02-14"))

paged_table(ksf_oyster_cylinder_growth_data_tidied)
=======
```{r}
#| label: tidy-ksf-oyster-cylinder-growth-data-dataset
#| code-fold: true
#| code-summary: <i>Steps to clean data</i>

oyster_var_names <- c("date", "oyster_large_weight", "oyster_large_gain", "oyster_small_weight", "oyster_small_gain", "oyster_chlorophyll")

ksf_oyster_cylinder_growth_data_tidied <- ksf_oyster_cylinder_growth_data %>% 
  dplyr::select(c(1, 4, 5, 8, 9, 12)) %>%
  slice(-1) %>%
  setNames(oyster_var_names) %>%
  pivot_longer(
    cols = c(oyster_large_weight, oyster_large_gain,
             oyster_small_gain,
             oyster_small_weight),
    names_to = c("oyster_size", ".value"),
    names_prefix = "oyster_",
    names_sep = "_",
    values_to = "value"
  ) %>%
  mutate(oyster_size = if_else(str_detect(oyster_size, "small"), "Small", "Large")) %>%
  mutate(oyster_size = as.factor(oyster_size),
         across(c(weight, gain), as.numeric)
        ) %>%
  filter(date >= as.Date("2023-11-20") & date <=
           as.Date("2024-02-14"))

paged_table(ksf_oyster_cylinder_growth_data_tidied)
```
>>>>>>> f79388ba582c3a45484705a0a0dd8a6de8037a5c
1
<<<<<<< HEAD Manually set variable names ======= Manually set variable names >>>>>>> f79388ba582c3a45484705a0a0dd8a6de8037a5c
2
<<<<<<< HEAD Select desired columns and remove first row ======= Select desired columns and remove first row >>>>>>> f79388ba582c3a45484705a0a0dd8a6de8037a5c
3
<<<<<<< HEAD Convert from wide to long format ======= Convert from wide to long format >>>>>>> f79388ba582c3a45484705a0a0dd8a6de8037a5c
4
<<<<<<< HEAD Create a new variable that differentiates oyster size ======= Create a new variable that differentiates oyster size >>>>>>> f79388ba582c3a45484705a0a0dd8a6de8037a5c
5
<<<<<<< HEAD Adjust data types to numeric and factor ======= Adjust data types to numeric and factor >>>>>>> f79388ba582c3a45484705a0a0dd8a6de8037a5c
6
<<<<<<< HEAD Filter to desired date range ======= Filter to desired date range >>>>>>> f79388ba582c3a45484705a0a0dd8a6de8037a5c

Address the NA values before merge

Steps to clean data <<<<<<< HEAD
water_samples_data_tidied <- water_samples_data %>%
  slice(-c(44:52)) %>% 
  rename_with(~gsub("\\s*\\([^\\)]+\\)", "", .x)) %>%
  janitor::clean_names() %>%
  mutate(
    date = if_else(date == "44074", 
                   as.character(as.Date("2024-01-09")),
                   format(dmy(date), "%Y-%m-%d")) 
  )

paged_table(water_samples_data_tidied)
=======
```{r}
#| label: tidy-profile-data-dataset
#| code-fold: true
#| code-summary: <i>Steps to clean data</i>
profiles_data_tidied <- profiles_data

paged_table(profiles_data_tidied)
```
>>>>>>> f79388ba582c3a45484705a0a0dd8a6de8037a5c

Done: - Date format - Date number conversion

The reason why it wasn’t working was because the date column was of a character data type, therefore we can easily convert 44074 but we were assuming that R knew the other dates were of a date type. So to fix this, we first needed to convert it into a date in the original format of DD-MM-YYYY, then convert it to our desired format of YYYY-MM-DD.

Need to do: - Names clean up - Add numbers to sample ID

Steps to clean data <<<<<<< HEAD
weather_data_tidied <- weather_data

paged_table(weather_data_tidied)
=======
```{r}
#| label: tidy-water-samples-data-dataset
#| code-fold: true
#| code-summary: <i>Steps to clean data</i>
water_samples_data_tidied <- water_samples_data %>%
   rename_with(~gsub("\\s*\\([^\\)]+\\)", "", .x)) %>%
  janitor::clean_names() %>%
   mutate(
    date = case_when(
      grepl("^\\d{2}/\\d{2}/\\d{4}$", date) ~ ymd(date),
     date == "44074" ~ as.Date("2024-01-09"),
     TRUE ~ as.Date(date, format = "%Y-%m-%d")
    )
  )

paged_table(water_samples_data_tidied)
```

Date format Date number conversion Names clean up Add numbers to sample ID

Steps to clean data
```{r}
#| label: tidy-weather-data-dataset
#| code-fold: true
#| code-summary: <i>Steps to clean data</i>
weather_data_tidied <- weather_data

paged_table(weather_data_tidied)
```
>>>>>>> f79388ba582c3a45484705a0a0dd8a6de8037a5c
Steps to clean data
profiles_data_tidied <- profiles_data

paged_table(profiles_data_tidied)

Merge Tidied Datasets

Export Tidied Datasets

Export tidied datasets to CSV into data/tidied folder:

source("code/functions/export_to_csv.R")

dfs_to_export <- list(
  ksf_clam_growth_data_tidied = ksf_clam_growth_data_tidied,
  ksf_compiled_data_tidied = ksf_compiled_data_tidied,
  ksf_oyster_cylinder_growth_data_tidied = ksf_oyster_cylinder_growth_data_tidied
)

imap(dfs_to_export, ~ export_to_csv(.x, .y, "data/tidied"))
1
List of dataframes we want to export as CSV files
2
Iterate the export_to_csv(df, df_name, dir_path) function over each dataframe. .x refers to the dataframe. .y refers to the name of the dataframe. These are passed to export_to_csv() function along with the desired directory path.

Export merged final data set into data/outputs folder.

Data Dictionary

Exploratory Data Analysis

Correlational Analysis

Back to top